Getting Started

Notes:

Install the packages called Amelia and Zelig. You can do so by copying and pasting the below code into your R Console. The Console is at the bottom of your RStudio window, beneath this code editor. Press enter to run both lines of code.

install.packages("Amelia")
install.packages("https://cran.r-project.org/src/contrib/Archive/Zelig/Zelig_4.2-1.tar.gz", 
                 repos=NULL, 
                 type="source")
                

Load the library Amelia. Also load Zelig.

library("Amelia")
## Warning: package 'Amelia' was built under R version 4.4.3
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(codebook)
## Warning: package 'codebook' was built under R version 4.4.3

Load the Data

Load the dataset called freetrade.

data("freetrade")
codebook <- freetrade
codebook <- detect_missing(codebook,
    only_labelled = TRUE, # only labelled values are autodetected as
                                   # missing
    negative_values_are_missing = FALSE, # negative values are missing values
    ninety_nine_problems = TRUE,   # 99/999 are missing values, if they
                                   # are more than 5 MAD from the median
    )
codebook(codebook)

Metadata

Description

Dataset name: codebook

The dataset has N=171 rows and 10 columns. 96 rows have no missing values on any column.

Metadata for search engines
  • Date published: 2025-03-20
x
year
country
tariff
polity
pop
gdp.pc
intresmi
signed
fiveop
usheg

Variables

year

Distribution

Distribution of values for year

Distribution of values for year

0 missing values.

Summary statistics

name data_type n_missing complete_rate min median max mean sd hist label
year numeric 0 1 1981 1990 1999 1990 5.493311 ▇▇▆▇▇ NA

country

Distribution

Distribution of values for country

Distribution of values for country

0 missing values.

Summary statistics

name data_type n_missing complete_rate n_unique empty min max whitespace label
country character 0 1 9 0 5 11 0 NA

tariff

Distribution

Distribution of values for tariff

Distribution of values for tariff

58 missing values.

Summary statistics

name data_type n_missing complete_rate min median max mean sd hist label
tariff numeric 58 0.6608187 7.1 25 100 31.65398 21.22296 ▇▅▁▂▁ NA

polity

Distribution

Distribution of values for polity

Distribution of values for polity

2 missing values.

Summary statistics

name data_type n_missing complete_rate min median max mean sd hist label
polity numeric 2 0.9883041 -8 5 9 2.905325 5.547675 ▅▂▁▆▇ NA

pop

Distribution

Distribution of values for pop

Distribution of values for pop

0 missing values.

Summary statistics

name data_type n_missing complete_rate min median max mean sd hist label
pop numeric 0 1 1.4e+07 5.3e+07 1e+09 149904501 254569660 ▇▁▁▁▁ NA

gdp.pc

Distribution

Distribution of values for gdp.pc

Distribution of values for gdp.pc

0 missing values.

Summary statistics

name data_type n_missing complete_rate min median max mean sd hist label
gdp.pc numeric 0 1 150 814 12086 1867.28 2562.952 ▇▂▁▁▁ NA

intresmi

Distribution

Distribution of values for intresmi

Distribution of values for intresmi

13 missing values.

Summary statistics

name data_type n_missing complete_rate min median max mean sd hist label
intresmi numeric 13 0.9239766 0.9 3.2 7.9 3.375218 1.572759 ▆▇▅▂▁ NA

signed

Distribution

Distribution of values for signed

Distribution of values for signed

3 missing values.

Summary statistics

name data_type n_missing complete_rate min median max mean sd hist label
signed numeric 3 0.9824561 0 0 1 0.1547619 0.3627588 ▇▁▁▁▂ NA

fiveop

Distribution

Distribution of values for fiveop

Distribution of values for fiveop

18 missing values.

Summary statistics

name data_type n_missing complete_rate min median max mean sd hist label
fiveop numeric 18 0.8947368 12 13 13 12.74118 0.3527204 ▅▇▂▁▇ NA

usheg

Distribution

Distribution of values for usheg

Distribution of values for usheg

0 missing values.

Summary statistics

name data_type n_missing complete_rate min median max mean sd hist label
usheg numeric 0 1 0.26 0.28 0.31 0.2764189 0.0151529 ▇▃▅▃▂ NA

Missingness report

Codebook table

JSON-LD metadata

The following JSON-LD can be found by search engines, if you share this codebook publicly on the web.

{
  "name": "codebook",
  "datePublished": "2025-03-20",
  "description": "The dataset has N=171 rows and 10 columns.\n96 rows have no missing values on any column.\n\n\n## Table of variables\nThis table contains variable names, labels, and number of missing values.\nSee the complete codebook for more.\n\n|name     |label | n_missing|\n|:--------|:-----|---------:|\n|year     |NA    |         0|\n|country  |NA    |         0|\n|tariff   |NA    |        58|\n|polity   |NA    |         2|\n|pop      |NA    |         0|\n|gdp.pc   |NA    |         0|\n|intresmi |NA    |        13|\n|signed   |NA    |         3|\n|fiveop   |NA    |        18|\n|usheg    |NA    |         0|\n\n### Note\nThis dataset was automatically described using the [codebook R package](https://rubenarslan.github.io/codebook/) (version 0.9.6).",
  "keywords": ["year", "country", "tariff", "polity", "pop", "gdp.pc", "intresmi", "signed", "fiveop", "usheg"],
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "variableMeasured": [
    {
      "name": "year",
      "@type": "propertyValue"
    },
    {
      "name": "country",
      "@type": "propertyValue"
    },
    {
      "name": "tariff",
      "@type": "propertyValue"
    },
    {
      "name": "polity",
      "@type": "propertyValue"
    },
    {
      "name": "pop",
      "@type": "propertyValue"
    },
    {
      "name": "gdp.pc",
      "@type": "propertyValue"
    },
    {
      "name": "intresmi",
      "@type": "propertyValue"
    },
    {
      "name": "signed",
      "@type": "propertyValue"
    },
    {
      "name": "fiveop",
      "@type": "propertyValue"
    },
    {
      "name": "usheg",
      "@type": "propertyValue"
    }
  ]
}`

Inspect Data

Question: Which variables exhibit missingness?

Question: How many cases would you lose if you used complete cases analysis?

75

Question: What proportion of observations are missing for each variable?

Complete Cases - 96 rows have no missing values on any column.

Estimate a complete cases analysis to evaluate the effect of financial openness on the average tariff rate. Use lm(...), not stan_glm(...).

\(y_{tariff} = \alpha + \beta_1 x_{fiveop} + \beta_2 x_{polity} + \beta_3 x_{intresmi} + u\)

Using the documentation included with the dataset freetrade, determine what the included variables represent:

Print your model summary, determine the regression coefficient and standard error associated with fiveop.

model <- lm(tariff ~ fiveop + polity + intresmi, data = freetrade)

summary(model)
## 
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = freetrade)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.654 -13.164  -6.111  10.528  49.705 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 381.0221    90.3047   4.219 5.69e-05 ***
## fiveop      -27.6751     7.1820  -3.853 0.000214 ***
## polity        0.9866     0.4245   2.324 0.022276 *  
## intresmi      1.2071     1.4701   0.821 0.413669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.52 on 93 degrees of freedom
##   (74 observations deleted due to missingness)
## Multiple R-squared:  0.1398, Adjusted R-squared:  0.112 
## F-statistic: 5.038 on 3 and 93 DF,  p-value: 0.002801

Multiple Imputation

Use Amelia to perform multiple imputation. Create 10 imputed datasets. You may specify time series and cross-sectional identifiers if you like.

# Run multiple imputation
amelia_output <- amelia(freetrade, 
                        m = 10,                      # number of imputed datasets
                        ts = "year",                 # time-series identifier (optional)
                        cs = "country")              # cross-section identifier (optional)
## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6  7  8  9 10 11
## 
## -- Imputation 6 --
## 
##   1  2  3  4  5  6  7  8  9 10 11
## 
## -- Imputation 7 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## 
## -- Imputation 8 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
## 
## -- Imputation 9 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
## 
## -- Imputation 10 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

Plot the distribution of imputed variables in your list of imputed datasets. Use the plot(...) function.

plot(amelia_output)
plot(amelia_output, which.vars = c("tariff", "fiveop", "intresmi"))

Use the function missmap(...) to figure out which units are missing the most data. Then, describe any patterns you see.

missmap(freetrade, main = "Missing Data Map", col = c("yellow", "black"), legend = TRUE)

Description:

The missmap reveals that tariff and fiveop are frequently missing, while intresmi and signed contain some missing data. Other variables like year, country, and economic indicators (gdp.pc, pop, etc.) are fully observed. Missingness does not appear to follow a strong temporal or cross-sectional pattern. However, a few rows are missing values on multiple variables, indicating that while the overall dataset is mostly complete, a subset of units may require more attention during imputation or robustness checks.

Estimating Multiple Models

Use the function called zelig(...) to estimate 10 linear models, one per imputed dataset. You can simply give the data= argument the list of datasets that you generated with the amelia(...) function. Zelig will estimate the five models for you. You must tell zelig(...) that we want to use least squares estimation. Do that by using the argument model="ls".

library(mitools)
## Warning: package 'mitools' was built under R version 4.4.3
# Convert Amelia output into imputation list
imputed_list <- imputationList(amelia_output$imputations)

# Run linear model on all imputations
models <- with(imputed_list, lm(tariff ~ fiveop + polity + intresmi))


for (i in 1:10) {
  cat("Model", i, "\n")
  print(summary(lm(tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])))
}
## Model 1 
## 
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.402 -12.122  -4.175   8.634  58.470 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 288.4358    57.5478   5.012 1.36e-06 ***
## fiveop      -20.4114     4.5674  -4.469 1.45e-05 ***
## polity        0.8041     0.2905   2.768  0.00627 ** 
## intresmi      0.2975     0.9239   0.322  0.74784    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.9 on 167 degrees of freedom
## Multiple R-squared:  0.1126, Adjusted R-squared:  0.09669 
## F-statistic: 7.065 on 3 and 167 DF,  p-value: 0.0001686
## 
## Model 2 
## 
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.309 -12.675  -3.585   9.207  61.081 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 281.9188    59.0307   4.776 3.89e-06 ***
## fiveop      -19.8048     4.6851  -4.227 3.89e-05 ***
## polity        0.6179     0.2958   2.089   0.0382 *  
## intresmi     -0.0633     0.9573  -0.066   0.9474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.51 on 167 degrees of freedom
## Multiple R-squared:  0.1001, Adjusted R-squared:  0.08397 
## F-statistic: 6.195 on 3 and 167 DF,  p-value: 0.0005137
## 
## Model 3 
## 
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.933 -12.896  -4.271   8.030  59.522 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 257.3483    62.3429   4.128 5.77e-05 ***
## fiveop      -18.0842     4.9611  -3.645 0.000356 ***
## polity        0.4641     0.3115   1.490 0.138146    
## intresmi      0.9023     1.0218   0.883 0.378474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.52 on 167 degrees of freedom
## Multiple R-squared:  0.07417,    Adjusted R-squared:  0.05754 
## F-statistic: 4.459 on 3 and 167 DF,  p-value: 0.004841
## 
## Model 4 
## 
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.942 -12.672  -3.956   8.279  60.547 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 285.8069    55.7133   5.130 7.97e-07 ***
## fiveop      -19.9934     4.4238  -4.519 1.17e-05 ***
## polity        0.5930     0.2794   2.122   0.0353 *  
## intresmi     -0.1946     0.9373  -0.208   0.8358    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.27 on 167 degrees of freedom
## Multiple R-squared:  0.1132, Adjusted R-squared:  0.09727 
## F-statistic: 7.106 on 3 and 167 DF,  p-value: 0.0001601
## 
## Model 5 
## 
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.711 -13.647  -3.935  10.635  58.904 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 324.5845    60.1347   5.398 2.29e-07 ***
## fiveop      -22.9457     4.7677  -4.813 3.31e-06 ***
## polity        0.8312     0.3013   2.759  0.00644 ** 
## intresmi     -0.5470     0.9966  -0.549  0.58384    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.72 on 167 degrees of freedom
## Multiple R-squared:  0.1308, Adjusted R-squared:  0.1152 
## F-statistic: 8.379 on 3 and 167 DF,  p-value: 3.2e-05
## 
## Model 6 
## 
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.550 -12.697  -2.220   9.585  54.094 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 418.5164    59.8428   6.994 6.15e-11 ***
## fiveop      -30.6253     4.7500  -6.447 1.17e-09 ***
## polity        1.0962     0.2970   3.690 0.000303 ***
## intresmi      0.2368     0.9869   0.240 0.810665    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.39 on 167 degrees of freedom
## Multiple R-squared:  0.2064, Adjusted R-squared:  0.1922 
## F-statistic: 14.48 on 3 and 167 DF,  p-value: 1.982e-08
## 
## Model 7 
## 
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.120 -13.286  -4.845   9.731  57.955 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 281.9362    59.5831   4.732 4.72e-06 ***
## fiveop      -19.8966     4.7224  -4.213 4.11e-05 ***
## polity        0.5472     0.2950   1.855   0.0654 .  
## intresmi      0.7322     0.9839   0.744   0.4578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.48 on 167 degrees of freedom
## Multiple R-squared:  0.09719,    Adjusted R-squared:  0.08097 
## F-statistic: 5.993 on 3 and 167 DF,  p-value: 0.0006662
## 
## Model 8 
## 
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.652 -13.305  -2.973   7.961  52.650 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 360.2770    60.0226   6.002 1.18e-08 ***
## fiveop      -26.4867     4.7673  -5.556 1.07e-07 ***
## polity        0.9128     0.2973   3.071  0.00249 ** 
## intresmi      1.7905     0.9849   1.818  0.07086 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.37 on 167 degrees of freedom
## Multiple R-squared:  0.1655, Adjusted R-squared:  0.1505 
## F-statistic: 11.04 on 3 and 167 DF,  p-value: 1.192e-06
## 
## Model 9 
## 
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.495 -13.503  -3.353  11.375  56.658 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 406.2765    58.9096   6.897 1.05e-10 ***
## fiveop      -29.5318     4.6867  -6.301 2.53e-09 ***
## polity        0.9411     0.3009   3.128  0.00208 ** 
## intresmi     -0.2171     1.0126  -0.214  0.83051    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.72 on 167 degrees of freedom
## Multiple R-squared:  0.2007, Adjusted R-squared:  0.1863 
## F-statistic: 13.97 on 3 and 167 DF,  p-value: 3.588e-08
## 
## Model 10 
## 
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.849 -11.933  -4.304   7.816  50.900 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 408.4392    57.4372   7.111 3.21e-11 ***
## fiveop      -30.0756     4.5604  -6.595 5.36e-10 ***
## polity        0.9249     0.2924   3.163  0.00186 ** 
## intresmi      1.5187     0.9309   1.631  0.10468    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.78 on 167 degrees of freedom
## Multiple R-squared:  0.2091, Adjusted R-squared:  0.1949 
## F-statistic: 14.71 on 3 and 167 DF,  p-value: 1.512e-08

Combine the Results

Summarize your linear model that is based on the imputed datasets.

pooled <- MIcombine(models)
summary(pooled)
## Multiple imputation results:
##       with(imputed_list, lm(tariff ~ fiveop + polity + intresmi))
##       MIcombine.default(models)
##                 results         se       (lower     upper) missInfo
## (Intercept) 331.3539664 87.6621317 152.37763334 510.330299     57 %
## fiveop      -23.7855472  6.9629234 -38.00212502  -9.568969     57 %
## polity        0.7732649  0.3667476   0.04257976   1.503950     36 %
## intresmi      0.4455890  1.2689131  -2.09911032   2.990288     43 %

What is the estimated effect of fiveop on tarrif? Interpret the effect carefully. Is the effect you’ve found different from the complete cases analysis? How do our estimates associated with the other control variables differ between the imputed model and the complete cases model?

The estimated effect of fiveop on tarrif from the imputed model is -27.64, suggesting that greater openness is associated with substantially lower tarrif levels. Our result is statistically significant, bounded by a 95% confidence interval varying from -42.44 to -12.85. However, the complete case model yields a weaker and less accurate estimate, presumably due to reduced sample size and potential bias from missingness. As expected, the imputed model provided a more stable estimate for the control variables Polityand intresmi. Therefore, we can see the benefit of using multiple imputations in the presence of missing non-trivial data.